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1.  Introduction 


The  n-body  problem,  a  well-established,  potentially  intractable  algorithm  with 
computational  complexity  of  o(«2),  is  a  critical  component  to  many  solutions  in 
fields  as  varied  as  biology,  chemistry,  physics,  and  engineering.1,2  Efforts  to 
manage  the  computational  cost  of  this  algorithm  have  resulted  in  the  development 
of  approximations  such  as  the  Bames-Hut  Fast  Multipole  Method  (FMM)  and  the 
Adaptive  Fast  Multipole  Method  (AFMM).3  5  The  Black-Box  Adaptive  Fast 
Multipole  Method  (bbAFMM)  used  in  this  work  is  a  variant  of  the  AFMM  and 
follows  a  similar  structure  but  without  the  explicit  definition  of  multipole 
expansions,  relying  instead  on  the  use  of  a  Chebyshev  interpolation  to  evaluate  a 
continuous  distribution  of  density  at  the  surface  of  each  box  that  encloses  a  set  of 
particles.6  The  result  is  an  algorithm  that  executes  with  a  lower  computational 
complexity  of  O(n)  for  n  particles  for  any  non-oscillatory  function  without 
sacrificing  the  efficacy  of  the  final  solution.7 

The  bbAFMM  algorithm  defines  a  set  of  8  independent  functions  or  kernels  that 
are  attractive  to  exploitation  within  the  high-performance  computing  (HPC) 
community  using  the  graphics  processor  unit  (GPU).8  This  work  examines  the 
particle-to-particle  (P2P)  kernel  using  an  algorithmic  analysis  that  focuses  not  just 
on  arithmetic  intensity,  but  GPU  memory  bandwidth,  GPU  peak  performance,  and 
peripheral  component  interconnect  express  (PCIe)  bandwidth  to  understand  the 
potential  performance  benefit  of  using  the  GPU  device  for  its  definition.  The 
resulting  algorithmic  analysis  is  then  compared  with  the  observed  results  of  the 
GPU-defined  P2P  kernel  we  developed  using  the  Compute  Unified  Device 
Architecture  (CUDA).9  A  brief  outline  of  the  rest  of  this  work  follows. 

The  next  section  will  discuss  the  background  of  the  bbAFMM  algorithm  including 
a  brief  description  of  the  8  defined  kernels  emphasizing  the  P2P  kernel  that  is  the 
focus  of  this  work.  Section  3  will  provide  the  algorithmic  analysis  of  the  P2P  kernel 
and  will  include  subsections  on  approximating  floating-point  operations,  counting 
bytes  as  well  as  GPU  peak,  memory,  and  PCIe  bandwidth,  as  they  are  critical  in  the 
final  analysis.  This  section  will  also  describe  the  computing  environment  employed 
in  this  work.  Section  4  describes  the  actual  implementation  of  the  P2P  kernel  using 
the  GPU  and  includes  the  resulting  observed  performance.  This  section  will 
examine  the  observed  results  in  relation  to  the  previous  section  on  algorithmic 
analysis.  The  last  section  presents  conclusions  and  future  work. 
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2.  Background 


The  bbAFMM  is  an  approximation  of  the  n-body  problem  and  structurally  similar 
to  AFMM  defining  8  separate  kernels  that  operate  within  a  tree  structure.10  Each 
level  /  + 1  in  the  tree  is  refined  recursively  from  level  l  by  subdividing  cubic  data 
structures  containing  source  particles  under  consideration  commonly  referred  to  as 
boxes.10,5  It  is  during  the  execution  of  these  recursive  phases  that  the  8  kernels  are 
called,  and  can  be  defined  as,  either  leaf  or  nonleaf  modes — i.e.,  execute  at  either 
the  leaf  or  nonleaf  levels  of  the  defined  tree  structure. 

The  multipole-to-multipole  (M2M),  multipole-to-local  (M2L),  particle-to-local 
(P2L),  and  local-to-local  (L2L)  are  defined  as  nonleaf,  while  the  particle-to- 
multipole  (P2M),  P2P,  multipole-to-particle  (M2P),  and  local-to-particle  (L2P)  are 
defined  as  leaf-mode  operations.  These  kernels  function  together  to  approximate 
the  global  solution  bbAFMM  through  far-  and  near- field  computations,  the  latter  of 
which  is  the  result  of  the  execution  of  the  P2P  kernel. 

The  P2P  kernel  calculates  the  contributions  of  particles  belonging  to  a  leaf  box  and 
its  neighboring  leaf  boxes,  and  is  analogous  to  the  all-pairs  interaction  given  by  the 
direct  n-body  solution.1,2  Letting  the  total  number  of  neighboring  clusters  at  a  given 
level  for  any  leaf  box  The  N(T),  the  computation  of  near- field  values  using  the  P2P 
kernel  is  given  mathematically  by  Eq.  1,  such  that  x,y  are  defined  as  a  well- 
separated  pair,11  i.e.,  not  sharing  a  boundary,  in  target  box  T  and  source  box 
S  ,  respectively,  for  kernel  K(x,y)  for  all  xi  e  T. 712 

/""'(*,)=  £  £4v,)v  a) 

ScN(T)yjcS 

It  follows  from  Eq.  1  that  the  GPU  should  be  leveraged  to  execute  the  P2P  kernel 
given  the  potential  high  floating-point  count  and  the  proclivity  of  the  device  to 
efficiently  consume  these  operations.2  However,  there  are  other  factors,  which  are 
discussed  in  the  following  section. 

3.  Algorithmic  Analysis  of  P2P 

This  section  will  discuss  the  actual  algorithmic  analysis  employed  by  this  work  and 
detail  the  methodology  used  to  determine  both  approximate  floating-point 
operation  counts  and  associated  bytes  moved  by  the  P2P  kernel.  This  section  also 
presents  the  actual  computing  environment  in  which  this  work  was  completed. 
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3.1  Computing  Environment  Employed 


The  computing  environment  used  for  this  work  is  a  64-node  heterogeneous  cluster 
consisting  of  48  IBM  dx360M4  nodes,  each  with  one  Intel  Phi  5110P  and  16 
dx360M4  nodes  each  with  one  NVIDIA  Kepler  K20M/K40M  GPU.  Each  node 
contained  dual  Intel  Xeon  E5-2670  (Sandy  Bridge)  central  processing  units 
(CPUs),  64  GB  of  memory,  and  a  Mellanox  FDR- 10  Infiniband  host  channel 
adaptor. 

However,  this  work  is  focused  on  the  behavior  of  a  single  kernel  and  as  such  does 
not  employ  multiple  processors.  This  work  makes  use  of  a  single  processing  core 
and  a  single  NVIDIA  Kepler  K40  GK110  architecture  with  2  PCIe  Gen  3.0  slots 
standard,  optional  2  double- widths  PCIe  for  GPUs  or  coprocessors.  The  defined 
metrics  for  the  hardware  are  1)  PCIe  bandwidth  (2  x  16  slot),  7.877  GFloat/s; 
Kepler  K40  peak,  4,290  x  1  billion  floating-point  operations  (GFLOPs),  and  288 
GB/s  Kepler  K40  memory  bandwidth.1314 

3.2  Calculating  FLOP  and  Byte  Counts 

The  calculation  of  FLOPs  and  bytes  moved  by  the  P2P  kernel  follows  from  the 
algorithm  employed  by  the  CPU  implementation  shown  in  Fig.  1  for  given  leaf  box 
T  with  “U-Fist”  the  current  list  of  neighbors  and  K  the  defined  kernel.  The  K 
kernel  defined  by  the  CPU  follows  the  algorithm  given  by  Fig.  2. 

Let  S  be  the  average  number  of  particles  per  leaf  node  and  U 

be  size  of  U-List  for  the  cluster  that  the  ith  particle 
resides  in. 

1.  FOR  4=0  TO  S 

2.  FOR  j  =  0  TO  U- 1 

3.  M^^Listlj])  //  pointer  to  cluster  of  neighbors 

4.  FOR  42  =  0  TO  ( M^^SIZE )  //  total  number  in  cluster 

5  .  f  [i]  =  f  [i]  +  (m-  — >  da  ta[i2]).  w  x  k(t  — »  da  ta[i],  M  ■  — >  da  ta[i2]) 

6  .  END  FOR 

7  .  END  FOR 

8.  FOR  42  =  4  +  1  TO  S  II  local  particles 

9  .  b[i]  =  f  [i]  +  (t  — >  da  ta[i2]).  w  x  k(t  — >  da  ta[i],  T  — >  da  ta[i2]) 

10  .  b[i]  =  f[i]+  (t  — »  data[i]).  w  x  k(t  — »  data[i2],T  — >  data[i]) 

11.  END  FOR 


Fig.  1  P2P  algorithm 
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Fig.  2  K  kernel  algorithm 


Given  that  specific  FLOP  counts  are  intrinsically  bound  to  hardware  design,  we 
take  an  asymptotic  approach  and  ignore  constants,  examining  instead  how  the 
algorithm  scales  for  given  inputs.15  The  analysis  completed  by  this  work  will  focus 
on  the  execution  of  the  P2P  kernel  for  a  single  leaf  node  N  ,  but  these  derivations 
can  be  extended  without  loss  of  generality  to  all  leaves  in  the  tree  structure 
employed  by  bbAFMM.  Analysis  of  the  P2P  algorithm  in  Fig.  1  reveals  several 
nested  loops  that  depend  both  on  the  number  of  neighboring  particle  clusters  and 
all  locally  defined  particles.  The  operations  that  occur  within  these  loops  compute 
the  values  for  the  neighboring  particles  fM  and  local  particles  fL  shown  in  Fig.  1 
lines  4-6  and  8-11,  respectively. 

The  mathematical  representation  of  the  P2P  algorithm  is  shown  as  Eq.  2  with  M  s‘ze 
being  the  total  number  of  particles  defined  in  cluster  U  _  List  at  index  j  . 


Z 


+  if- 

i2=i+l 


(2) 


Let  the  number  of  neighboring  particles  from  the  cluster  at  index  j  shown  in  Fig. 
line  4  be  averaged  as  Eq.  3,  which  assumes  a  worst-case  scenario. 


Ms,ze 

1V1  7=0 


■M]l\ 


M  size 

1V1  j  =  U-l 


U  - 1 


Following  asymptotic  analysis,15  the  operations  fM  and  fL  are  defined  as 
constants,  and  by  applying  Eq.  3,  Eq.  2  becomes  Eq.  4. 

z  z  Wo{i«2+s’)^o(itt’).  (4) 

j=0|_7=0  i'2=i'+1  J  i=o\j=0  J  i=0  \i2=i+l  J 


The  determination  of  the  total  number  of  bytes  moved  by  the  P2P  kernel  follows 
the  algorithm  presented  by  Fig.  1  but  only  counts  any  potential  movement  across 
the  PCIe  bus,  the  established  bottleneck  for  many  GPU -based  solutions.16,17 
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Limiting  the  byte  count  to  global  data  movements  removes  the  K  kernel  from  the 
equation  and  leaves  only  the  particles  defined  by  the  given  leaf  box  T  and  any 
neighbors  defined  by  Eq.  3  for  each  member  in  the  U-List.  The  resulting  asymptotic 
behavior  of  bytes  moved  for  the  P2P  kernel  is  shown  as  Eq.  5. 

O (US  +  S)=>  O {US) .  (5) 


3.3  Kernel  Arithmetic  Intensity 

The  arithmetic  intensity  is  the  ratio  of  FLOPs  (Eq.  4)  to  bytes  moved  (Eq.  5)  and  is 
directly  related  to  potential  GPU  performance  such  that  the  higher  the  ratio,  the 
better  the  performance  is  likely  to  be  using  the  GPU.  Clearly  the  arithmetic  intensity 
given  by  the  P2P  kernel  scales  as  a  product  of  the  average  number  of  particles  per 
leaf  node  S  and  the  average  number  of  particles  per  neighboring  clusters.  This 
computed  arithmetic  intensity  is  examined  in  relation  to  the  defined  GPU  memory 
bandwidth,  GPU  peak  performance,  and  PCIe  bandwidth  in  the  next  subsection. 


3.4  Hardware-Derived  Metrics 

There  are  2  main  issues  when  running  a  kernel  on  the  GPU:  proper  utilization  of 
the  GPU  and  the  cost  of  data  transfer  over  the  PCIe.  Given  a  kernel  with  high 
arithmetic  intensity  such  as  P2P,  proper  utilization  of  the  GPU  is  accomplished  with 
a  bigger  kernel  and  setting  the  number  of  threads  per  block  at  optimal  levels. 
However,  estimating  the  cost  of  data  transfer  over  the  PCIe  bus  is  more  involved. 

The  estimated  speed  at  which  the  PCIe  can  supply  the  data  necessary  for  the  P2P 
kernel  to  operate  at  optimal  levels  must  be  determined.  The  performance  of  the 
hardware  employed  in  this  work,  detailed  in  Section  3.1,  is  used  to  derive  the 
metrics  for  this  subsection.  The  estimated  PCIe  bandwidth  is  given  by  Eq.  6  and 
the  estimated  GPU  computational  speed  is  given  by  Eq.  7,  with  Kbyte  and  Kfiop  the 
number  of  bytes  and  FLOPs  for  the  P2P  kernel,  and  PCIebw  and  GPUfiop  the  PCIe 
bandwidth  and  GPU  FLOPs  peak. 


T  = 

1  PCIe  ~ 


Kbyte*W'9 

PCIe 


(6) 


T  = 

1  flop  - 


GPUflop 


(7) 


Computing  the  values  for  Eqs.  6  and  7  is  accomplished  by  factoring  out  U  for 
Eqs.  4  and  5  to  determine  the  approximate  FLOPs  and  bytes  moved  for  the  P2P 
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kernel.  The  approximated  kernel  FLOPs  and  bytes  moved  are  then  applied  to 
Eqs.  6  and  7  for  the  final  estimated  performance  results.  The  estimated  performance 
results  are  computed  for  increasing  numbers  of  particles  and  shown  in  the  Table, 
where  S  is  defined  as  the  number  of  particles. 

Table  Estimated  P2P  kernel  performance 


Number  of 
Particles 

T 

*  PCIe 

T 

flop 

T  -T 

1  PCIe  1  flop 

4,913 

6.24e-07 

5.63e-06 

5.001e-06 

9,261 

1.18e-06 

2.00e-05 

1.88e-05 

13,824 

1.75e-06 

4.45e-05 

4.28e-05 

24,389 

3.10e-06 

1.39e-04 

1.36e-04 

97,336 

1.24e-05 

2.21e-03 

2.20e-03 

The  Table  shows  that  for  increasing  numbers  of  particles,  the  time  it  takes  the  PCIe 
to  transfer  data  to  the  kernel  is  less  than  it  takes  the  device  to  process  the  results. 
This  indicates  an  operation  that  is  compute-bound  rather  than  memory-bound  and 
likely  a  good  candidate  for  definition  using  the  GPU. 

4.  Implementing  P2P  as  a  GPU  Kernel 

A  critical  aspect  to  building  an  efficient  P2P  GPU  kernel  was  to  keep  as  much  data 
as  possible  on  the  device  rather  than  multiple  calls  across  the  PCIe  bus.  This  was 
accomplished  using  vectors  to  both  store  actual  cluster  data  and  provide  indirection 
pointers  to  these  clusters.  These  indirection  and  data  clusters  are  briefly  described 
in  the  next  subsection. 

4.1  Kernel  Indirection  and  Data  Clusters 

The  set  of  indirection  vectors  that  are  passed  to  the  GPU  kernel  consist  of  the  global 
cluster  index,  the  associated  neighbor  index,  starting  point  for  neighbor,  and  the 
starting  point  for  local  called  cidx,  uidx,  ulist,  and  cpartjdx,  respectively  (see 
Fig.  3).  The  global  thread  index  is  computed  using  CUDA  and  used  to  determine 
the  current  local  and  associated  neighboring  clusters  with  computed  offsets 
cpartjdx  and  ulist,  respectively.  The  indirections  provide  a  means  to  map  proper 
algorithm  behavior  to  the  GPU,  as  the  device  will  only  see  data  within  a  given 
thread  block  without  regard  to  any  real  structure.  The  indirections  properly  isolate 
threads  to  defined  clusters  of  data.  These  indirections  for  local  and  neighbor  clusters 

are  then  applied  to  retrieve  the  actual  coordinate  data  held  with  the  particles 
structure.  The  number  of  threads  per  block  for  the  GPU  kernel  is  defined  based  on 
the  total  number  of  coordinate  particles  held  by  the  particles  structure. 
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Fig.  3  P2P  data  and  indirection  pointers 

The  P2P  GPU  kernel  executes  for  each  cluster,  storing  the  computed  potentials  for 
each  in  the  global  structure  that  was  copied  over  the  PCIe  bus.  The  algorithm  for 
the  P2P  GPU  kernel  is  shown  in  Fig.  4.  Once  the  kernel  completes,  the  CPU  will 
collect  each  potential  from  the  global  structure  and  copy  it  to  each  of  the  leaves  in 
the  current  tree  structure. 

The  observed  performance  of  the  implemented  P2P  GPU  kernel  and  its  relation  to 
the  predictive  analysis  are  discussed  in  the  next  subsection. 
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Let  N  be  total  number  of  particles,  P  be  collection  of  particles,  CIDX  be  the 
global  cluster  index,  UIDX  be  the  associated  neighbors  index,  and  CPART  _1DX 
be  the  starting  point  to  particle  index 
Output:  / 

1.  gidx  <=■  global _threadID  II  from  CUDA 

2.  IF  gidx  <  N  THEN 

3.  SET  F  <—  0 

4.  pi  <=  p[g/dx] 

5.  I  <=  CIDX[gidx] 

6.  FOR  j  <^UIDX[l]TOUIDX[l  +  l] 

7.  J  <-  ULIST[j] 

8 .  FOR  i2  <-  CPART  _  IDX  [j  ]  TO  CPART  _  IDX  [j  + 1] 

9.  pj  <^=  P[i2\ 

10.  F  <—  F  +  pj.wx  K(pi,  pj) 

11.  END  FOR 

12.  END  FOR 

13.  FOR  j  <—  CPART  _  IDX [/ ]  TO  gidx 

14.  pj  <=  P[j] 

15.  F  <-F  +  pj.wx  K(pi,  pj) 

16.  END  FOR 

17.  FOR  j  <—  gidx  +  1TO  CPART  _  IDX[l  + 1] 

18.  pj  <=  P[j] 

19.  F  <—  F  +  pj.wx  K(pi,  pj) 

20.  END  FOR 

21.  /  \gidx\  f  [gidx]  +  F 

22.  END  IF 


Fig.  4  P2P  GPU  kernel  algorithm 

4.2  Observed  Performance  and  Predictive  Analysis 

The  P2P  GPU  kernel  demonstrates  dramatic  performance  improvements  over  the 
CPU-only  implementation.  This  GPU  reveals  a  speed-up  factor  of  over  500  times 
greater  than  the  single-threaded  CPU  processor  version  for  close  to  100,000 
particles,  as  shown  in  Fig.  5.  These  results  are  congruent  with  the  predicted  analysis 
regarding  both  the  arithmetic  intensity  of  the  kernel  itself  and  the  hardware-derived 
metrics.  There  are  several  reasons  for  this  observed  performance  increase  when 
employing  the  GPU  over  the  serial  CPU. 
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/ - \ 

P2P  Kernel  Performance 


Total  Particles 


V _ / 


Fig.  5  GPU-defined  P2P  kernel  performance 

The  first  reason  is  the  most  obvious:  The  CPU  version  is  not  optimal  given  that  a 
single  process  is  being  employed  without  leveraging  any  of  the  multithread 
capabilities.  The  GPU  executes  thousands  of  cores  that  are  employed  for  even  the 
simplest  of  functions,  and  this  puts  the  CPU  at  a  severe  disadvantage  from  the 
start.13  Another  reason  for  the  substantial  performance  benefit  of  the  GPU  for  the 
P2P  kernel  is  that  both  the  locality  and  basic  operation  is  perfectly  suited  for  the 
data-throughput  model  of  the  device.  The  data  managed  by  the  kernel  is  executed 
in  unison  with  no  coalescing  or  bank  conflict  issues. 

5.  Conclusions  and  Future  Work 


This  work  documented  the  implementation  of  the  P2P  kernel  for  bbAFMM  using  a 
shared-memory  single  GPU  paradigm  with  CUDA  as  the  language  vehicle  and  has 
shown  dramatic  performance  increases  over  the  corresponding  CPU 
implementation.  The  P2P  GPU  kernel  revealed  a  speed-up  factor  of  more  than  500 
times  for  close  to  100,000  particles.  These  observed  results  are  congruent  with 
predictive  results  gleaned  from  both  algorithmic  analysis  and  hardware-derived 
metrics  that  include  GPU  memory  bandwidth,  GPU  peak  performance,  and  PCIe 
bandwidth. 

In  the  future  we  would  like  to  apply  these  algorithm  analysis  techniques  with  other 
kernels  defined  using  bbAFMM.  Particular  interest  resides  in  the  analysis  and 
implementation  of  the  M2L  kernel,  as  this  comprises  the  majority  of  bbAFMM  and 
would  present  the  largest  payoff. 
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List  of  Symbols,  Abbreviations,  and  Acronyms 


AFMM 

Adaptive  Fast  Multipole  Method 

bbAFMM 

Black-Box  Adaptive  Fast  Multipole  Method 

CPU 

central  processing  unit 

CUDA 

Compute  Unified  Device  Architecture 

FMM 

Fast  Multipole  Method 

FLOP 

floating-point  operation 

GFLOPs 

1  billion  floating-point  operations 

GPU 

graphics  processor  unit 

HPC 

high-performance  computing 

L2L 

local-to-local 

L2P 

local-to-particle 

M2L 

multipole-to-local 

M2M 

multipole-to-multipole 

M2P 

multipole-to-particle 

P2L 

particle-to-local 

P2M 

particle-to-multipole 

P2P 

particle-to-particle 

PCIe 

peripheral  component  interconnect  express 
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